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CNJ Abstract 

Gibbs random fields play an important role in statistics, for example the autologistic 
model is commonly used to model the spatial distribution of binary variables defined 
on a lattice. However they are complicated to work with due to an intractability of the 
likelihood function. It is therefore natural to consider tractable approximations to the 
likelihood function. Composite likelihoods offer a principled approach to constructing 
such approximation. The contribution of this paper is to examine the performance of a 
i— i collection of composite likelihood approximations in the context of Bayesian inference. 
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l^j 1 Introduction 

j> Gibbs random fields play an important and varied role in statistics. The autologistic model 

is used to model the spatial distribution of binary random variables defined on a lattice or 



grid (Besag 1974). The exponential random graph model or p* model is arguably the most 



popular statistical model in social network analysis (Robins et al 2007). Other application 



!>■ areas include biology, ecology and physics. 

Despite their popularity, Gibbs random fields present considerable difficulties from the 
point of view of parameter estimation, because the likelihood function is typically intractable 
for all but trivially small graphs. One of the earliest approaches to overcome this difficulty is 



the pseudolikelihood method (Besag 1972), which approximates the joint likelihood function 
by the product of full-conditional distributions of all nodes. Indeed the pseudolikelihood 
approximation is an example of a composite likelihood, that is, a likelihood approximation 
consisting of a product of a joint distribution of a lower dimensional variables, each of which 
can be normalised. It is natural to consider approximations which refine pseudolikelihood 
by considering products of larger collections of variables. The purpose of this paper is to 
consider such composite likelihood approximations. A similar study has been conducted by 



Okabayshi et al. (2011), although from a likelihood inference perspective. As in this paper, 



they consider likelihood approximations consisting of a product of a joint distribution of 
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collections of neighbouring variables. Using the recursion method of (Reeves and Pettitt 



2004) we show that larger collections of variables can be used. 



This paper is organised as follows. Section [2] outlines a description of Gibbs random 
fields, and in particular the autologistic distribution. Composite likelihoods are introduced in 
Section [3j Here we focus especially on how to formulate conditional composite likelihoods for 
application to the autologistic model. We also focus on the issue of calibrating the composite 
likelihood function for use in a Bayesian context. Section [4] illustrates the performance of the 
various estimators for simulated data. The paper concludes with some remarks in Section [5j 



2 Discrete- valued Markov random fields 

Discrete Markov random fields play an important role in several areas of statistics including 
spatial statistics and social network analysis. The autologistic model, popularised by Besag 



(1972) which has the Ising model as a special case, is widely used in analysis of binary spatial 



data defined on a lattice. The exponential random graph (or p*) model is frequently used 



to model relational network data. See (Robins et al 2007) for an excellent introduction to 
this body of work. 

Let y = {yi, . . . , y n } denote realised data defined on a set of nodes {1, . . . , n} of a graph, 
where each observation y^ takes values from some finite state space. The likelihood of y given 
a vector of parameters 9 = (9i, . . . , 9 m ) is defined as 

f(y\9) oc exp(e T s(y)) := q(y\6), (1) 

where s(y) = (si(y), . . . , s m (y)) is a vector of sufficient statistics. The constant of propor- 
tionality in ([!]), 

z(9) = J2^M0 T s(y)), 

y&Y 

depends on the parameters 9, and is a summation over all possible realisation of the Gibbs 
random field. Clearly, z(9) is intractable for all but trivially small situations. This poses 
serious difficulties in terms of estimating the parameter vector 9. 

One of the earliest approaches to overcome the intractability of the ([I]) is the pseudolike- 



lihood method (Besag 1975) which approximates the joint distribution of y as the product 



of full-conditional distributions for each y i} 

n 

f P seudo(y) = Y[f(yi\y-i,9), 



i=l 



where y-i denotes y\{y%}- This approximation has been shown to lead to unreliable estimates 
of 9, for example, ( |Ryden and Titterington 1998 ), (Friel et al 2009). This is in fact one of the 



earliest composite likelihood approximations, and we will outline work in this area further in 
Section |3j Note also that Monte Carlo approaches have also been exploited to estimate the 
intractable likelihood, for example the Monte Carlo maximum likelihood estimator of Geyer 
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and Thompson (1992). More recently, auxiliary variable approaches have been presented to 



tackle this problem through the single auxiliary variable method (M0ller et al 2006) and the 



exchange algorithm (Murray et al 2006) 



The autologistic model, first proposed by Besag (1972), is defined on a regular lattice of 
size mx m', where n = mm'. It is used to model the spatial distribution of binary variables, 
taking values —1,1. The autologistic model is defined in terms of two sufficient statistics, 



i=l j=l i~j 

where the notation i ~ j means that lattice point i is a neighbour of lattice point j. Hence- 
forth we assume that the lattice points have been indexed from top to bottom in each column 
and where columns are ordered from left to right. For example, for a first order neighbour- 
hood model where an interior point %ji has neighbours {yi- m ,yi-i,yi+i,yi+ m }- Along the 
edges of the lattice each point has either 2 or 3 neighbours. The full-conditional of yi can be 
written as 

p(Vi\y-i, 8) oc exp (fioyi + QiVi{y % - m + Vi-i + Vi+i + Vi+ m )) , (2) 
where y-i denotes y excluding yi. As before, the conditional distribution is modified along the 



edges of the lattice. The Hammersley-Clifford theorem (Besag 1974) shows the equivalence 



between the model defined in ^ and in ([I]). Note that parameter 9q controls the relative 
abundance of —1 and +1 values. The parameter Q\ controls the level of spatial aggregation. 
When 9 > 0, neighbouring values tend to take similar values, thereby yielding homogeneous 
regions in the lattice. Note that the Ising model is a special case, resulting from 8q = 0. 



3 Composite likelihoods 



There has been considerable recent interest in composite likelihood methods in the statistics 



literature under such headings as pairwise likelihood methods (Nott and Ryden 1999), com- 



posite likelihoods (Heagerty and Lele 1998), (Cox and Reid 2004) and split-data likelihoods 



(Ryden 1994). These concepts have long-standing antecedents such as Besag's pseudolikeli- 



hood (Besag 1975). The basic idea is to work with a likelihood made up of factors, each of 



which corresponds to the joint probability function of a small number of variables, two in 
the case of pairwise likelihoods. 

Let us know return to the case where y is a realisation from an autologistic distribution. 
Following the previous section we denote A = {1, . . . ,mm'} as an index set for the lattice 



points. Following (Asuncion et al 2010) we consider a general form of composite likelihood 
written as 



c 



f{y\9)^J[p{yA^e) :=CL{y\9). 



(3) 



Some special cases arise: 



1. Ai = A, Bi = 0, C = 1 corresponds to the full likelihood. 
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2. Bi = is often termed marginal composite likelihood. 

3. Bi = A \ Ai is often termed conditional composite likelihood. 

The focus of this paper is on conditional composite likelihoods. Note that the pseudolike- 
lihood approximation is a special of 3. where each Ai is a singleton. We restrict each Ai 
to be of the same dimension and in particular to correspond to contiguous square 'blocks' 
of lattice points of size k x k. In terms of the value of C in case 3., an exhaustive set of 
blocks would result in C = (m — k + 1) x (n — k + 1). In particular, we allow the collection 
of blocks {Ai} to overlap with one another. Finally, it is worth noting that in our context 
marginal distributions, p(y Ai \9), for index sets Ai, are rarely, if ever, available. Hence we 
don't consider marginal composite likelihoods in this context. 

Our interest here is to compare the performance of estimation of 6 using the conditional 
composite likelihood and especially to understand the statistical efficiency as the block size k 
increases. It should also be evident that the computational complexity of this approximation 
will increase dramatically with k, the size of the blocks. Consequently our interest here is 
to explore the trade-off between using a larger block size, with a smaller number of blocks 
Cin Q. 



3.1 Computing full-conditional distributions of Ai 

The conditional composite likelihood which we described above relies on evaluating 

, , m _ exp (9 so(y Ai ) + s^yAjly-Ai)) 

where 

soiVA,) = J2vj, sx{y Ai \y- Ai ) = ^^VWj- 

Also the normalising constant now includes the argument y Ai emphasising that it involves a 
summation over all possible realisations of sub-lattices defined on the set Ai and conditioned 
on the realised y~ Ai - First we describe an approach to compute the overall normalising 
constant for a lattice, without any conditioning on a boundary. 

Generalised recursions for computing the normalizing constant of general factorisable 
models such as the autologistic models have been propos ed by |Reeves and Pettitt (20 04) , 



generalising a result known for hidden Markov Models (e.g. Zucchini and Guttorp 1991 Scott 



2002). This method applies to autologistic lattices with a small number of rows, up to about 
20, and is based on an algebraic simplification due to the reduction in dependence arising 
from the Markov property. It applies to un-normalized likelihoods that can be expressed as 
a product of factors, each of which is dependent on only a subset of the lattice sites. We can 
write q(y\9) in factorisable form as 



i=i 
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where each factor depends on a subset y i = y i: yi+i, ■ ■ ■ , y%+ m of where m is defined to 
be the lag of the model. We may define each factor as 

QiiVi, P) = exp{9 yi + 9iyi(y l+ i + y i+m )} (5) 

for all i, except when i corresponds to a lattice point on the last row or last column, in which 
case y i+ i or yi +m , respectively, drops out of (JH]). 

As a result of this factorisation, the summation for the normalizing constant, 

n 

*W = £ II 

y »=i 

can be represented as 

= E 9n(Vn\0) E ^-l(^-ll^) ' ' ' E «l(t/ll^) ( 6 ) 

2/n J/n-1 2/1 

which can be computed much more efficiently than the straightforward summation over the 
2 n possible lattice realisations. Full details of a recursive algorithm to compute the above 



can be found in Reeves and Pettitt (2004) Note that this algorithm was extended in (Friel 
and Rue 2007) to also allow exact draws from p(y\9) 

The minimum lag representation for an autologistic lattice with a first order neighbour- 
hood occurs for r given by the smaller of the number of rows or columns in the lattice. Iden- 
tifying the number of rows with the smaller dimension of the lattice, the computation time 
increases by a factor of two for each additional row, but linearly for additional columns. It is 
straightforward to extend this algorithm to allow one to compute the normalising constant 
in Q, so that the summation is over the variables y^ and each factor involves conditioning 
on the set y-A v 

3.2 Bayesian composite likelihoods 

The focus of interest in Bayesian inference is the posterior distribution 

p(6\y) oc f{y\9)p{9). 

Our proposal here is to replace the true likelihood f(y\9) with a conditional composite 
likelihood approximation, leading us to focus on the approximated posterior distribution 

p*(9\y)KCL(y\6)p(6). 

Surprisingly, there is very little literature on the use of composite likelihoods in the Bayesian 



setting, although Pauli et al. (2011) present a discussion on the use of conditional composite 



likelihoods. Indeed this paper suggests, following (Lindsay 1988), that a composite likelihood 
should take the general form 



c 



f(y\9)^Hp( yAi \y Bi ,9r, (7) 



i=l 
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where u>j are positive weights. In all experiments carried out here, we assume that Wi = 1, and 
empirically we observe that non-calibrated composite likelihood leads to an approximated 
posterior distribution with substantially lower variability than the true posterior distribution, 
leading to overly precise precision about posterior parameters. 



4 Examples 

Here we simulated 20 realisations from a first-order Ising model all defined on a 16 x 16 lattice, 
with a single interaction parameter 9 = 0.4, which is close to the critical phase transition 
beyond which all realised lattices takes either value +1 or —1. This parameter setting is the 
most challenging for the Ising model, since realised lattices exhibit strong spatial correlation 
around this parameter value. For a lattice of this dimension it is possible to exactly calculate 
the normalising constant z(6). Using a fine grid of {(9j} values, the right hand side of: 

p(8i\y) ex i = l,...,n. 

can be evaluated exactly. Summing up the right hand side yields an estimate of the evidence, 
p(y), which is the constant of normalisation for the expression above and which in turn can 
be used to give a very precise estimate of p(0\y). This serves as a ground truth against 
which to compare with the posterior estimates of 9 using the various composite likelihood 
estimators. Exact calculation of z{9) and the approach described above to get a precise 



estimate of the evidence relies on algorithms developed in (Friel and Rue 2007). For this 
experiment, we choose a uniform [—10, 10] prior for 9 . 

In terms of MCMC implementation, 5000 iterations were used with a burn in period of 
1, 000 iterations for each dataset. Computation was carried out on a desktop PC with a 
3.33Ghz processor and with 4Gb of memory. Computation time for each of the different 
composite likelihoods was approximately constant and took 0.004 second of CPU time per 
iteration. This was achieved by exhaustively using all 3 x 3 blocks in the CCL 3 approximation, 
40% of all 4 x 4 blocks, 20% of all 5 x 5 blocks and 10% of all 6 x 6 blocks in the CCL 4 , CCL 5 
and CCL6 approximations, respectively. The results for simulations involving the 16 x 16 
lattices are displayed in Figure [Tj Here each of the conditional composite likelihood methods 
perform better than pseudo likelihood. Each of the composite likelihood approximations 
performed equally well, although the CC 4 , CCL 5 and CCL 6 display larger spread than CCL 3 . 

It is apparent from Table [T] that the estimated posterior variance of 9 for each of the 
approximations are generally lower than the true posterior variances. In fact the conditional 
composite likelihood approximations lead to posterior variance estimates than are smaller 
by a factor of 10. This strongly suggests that the conditional composite likelihoods need to 
be calibrated in some form. 

In a similar vein to the previous experiment, here we examined the performance of the 
various approximations on simulated data defined on larger 50 x 50 lattice. In this instance 
we can't analytically compute the true likelihood, however here we used the exchange algo- 



rithm (Murray, Ghahramani and MacKay 2006) to generate draws from the target posterior 
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CCL3 



CCL4 



CCL5 



CCL6 



Pseudo 



Figure 1: 16 x 16 lattices: Boxplot displaying the bias of the estimate of 6 for 20 datasets 
corresponding to each of 4 conditional composite likelihood approximations, CCL 3 , CCL 4 , 
CCL 5 and CCL 6 (corresponding to block size 3 x 3, 4 x 4, 5 x 5, 6 x 6, respectively) and 
also pseudolikelihood estimator. Note that the computational time for each of the composite 
likelihood approximations was held constant. 



CCL 3 CCL 4 CCL 5 CCL 6 Pseudo True 

2.1 x 1(T 4 3.0 x 10" 4 3.3 x 10~ 4 4.3 x 10~ 4 2.1 x 10~ 3 1.6 x 10~ 3 



Table 1: 16 x 16: Average posterior variance for 8 for each of 20 datasets. 
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CCL3 



CCL4 



CCL5 



CCL6 



Pseudo 



Figure 2: 50 x 50 lattices: Boxplot displaying the bias of the estimate of 6 for 20 datasets 
corresponding to each of 4 conditional composite likelihood approximations, CCL 3 , CCL 4 , 
CCL 5 and CCL 6 (corresponding to block size 3 x 3, 4 x 4, 5 x 5, 6 x 6, respectively) and 
also pseudolikelihood estimator. Note that the computational time for each of the composite 
likelihood estimator was held constant. 

from a very long MCMC run. The simulation study was otherwise similar in every other 
respect. Computation time was approximately 0.1 second per iteration of the MCMC algo- 
rithm using the various composite likelihood approximations. Here the performance of the 
conditional composite likelihood was again similar to each other, and better generally, in 
terms of lower bias, than posterior mean estimation using the pseudolikelihood approxima- 
tion. Here, similar to the previous experiment, we see in Table [2] that the posterior variance 
based on the various the conditional composite likelihood are considerably small, than that 
estimated by the exchange algorithm. 

CCL 3 CCL A CCL 5 CCL % Pseudo Tr^e" 
1.5 x 10~ 3 3.3 x 10~ 5 2.2 x 10" 5 2.6 x 10~ 5 2.1 x 10" 4 1.3 x 10~ 2 

Table 2: 50 x 50: Average posterior variance for 9 for each of 20 datasets. 



4.1 Why is the posterior variance of estimators based on compos- 
ite likelihoods overly precise? 

The results of this section suggest that using conditional composite likelihoods leads to 
considerably underestimated posterior variances. A possible explanation for this behaviour 
may be due to a type of 'annealing' effect where the true likelihood is replaced by a powered 
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version of it, leading to an overly concentrated likelihood function. Here the true likelihood 
f(y\9) is replaced by rii=i PilJAi \UA\Ai-, 0) Wi . Suppose that Wi — 1 for all i (as is the case in all 
of the experiments carried out here) and suppose further that C is large, whereby potentially 
many blocks overlap. In this scenario the set of interactions in the true likelihood will be a 
subset of all the interactions in the conditional composite likelihood, due to the overlapping 
blocks, and this will in turn lead to an annealing of the true likelihood function. 

5 Conclusion 

This paper has illustrated the important role that composite likelihood approximations can 
play in the statistical analysis of Gibbs random fields, and in particular in the Ising and 
autologistic models in spatial statistics. This paper has focused on the use of conditional 
composite likelihoods, based on tractable full-conditional distributions over blocks of lattices 
points and shows much promise. However it is evident that the posterior distribution of 
the interaction parameter 9 is too concentrated and therefore underestimates the posterior 
variance. An important research question is to ask how to correctly calibrate the conditional 
composite likelihood so that it achieves the correct variance. 

The computational complexity of this approximation increases dramatically as the size of 
blocks increases and our study here shows that efficient parameter estimation can result by 
considering conditional composite likelihoods based on a subset of possible blocks, thereby 
reducing computation time. For future work it would be interesting to understand how 
composite likelihoods can be usefully employed for more challenging Markov random fields 
models. 
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